library(Seurat)
library(tidyverse)
library(symphony)
library(ggpubr)
library(patchwork)
library(RColorBrewer)
Install package from github
Set projection folder, Download reference object and UMAP model
# Load Symphony reference
ref <- readRDS(paste0(projection_path, 'BoneMarrow_RefMap_SymphonyRef.rds'))
Error in paste0(projection_path, "BoneMarrow_RefMap_SymphonyRef.rds") :
object 'projection_path' not found
If we want to visualize celltype labels or metadata from the BM Reference, we can create a Seurat Object from the symphony reference This will be memory efficient as it will not include gene expression counts, only the UMAP coordinates and the metadata including cell labels and sorting information
ReferenceSeuratObj <- create_ReferenceObject(ref)
Warning: No assay specified, setting assay as RNA by default.Warning: Keys should be one or more alphanumeric characters followed by an underscore, setting key from umap to umap_Warning: All keys should be one or more alphanumeric characters followed by an underscore '_', setting key to umap_
DimPlot(ReferenceSeuratObj, reduction = 'umap', group.by = 'CellType_Annotation_formatted',
raster=FALSE, label=TRUE, label.size = 4)
We can visualize other annotations too, including cell cycle phase and lineage pseudotime estimates.
p1 <- DimPlot(ReferenceSeuratObj, reduction = 'umap', group.by = 'CyclePhase', raster=FALSE)
p2 <- FeaturePlot(ReferenceSeuratObj, reduction = 'umap', features = 'Pseudotime', raster=FALSE)
p1 + p2
We have confidently mapped leukemias spanning AML (incl. AMKL and AEL), B-ALL, MPAL, MDS, CML, MPN, and BDPCN, across sequencing technologies. However, we cannot be confident in mapping T-ALL due to a lack of thymus-specific reference data on T cell precursor stages. Further, our tool does not discriminate between normal vs malignant cells, which is an important consideration particularly within low-blast count chronic diseases.
As an example here, we are going to project scRNA-seq data from three diverse AML patients sequenced in our study. - pt_17844: MLL-AF9 translocation with a purely mature leukemia cell hierarchy - pt_17746: NPM1c + DNMT3A + TET2 with a GMP-dominant leukemia cell hierarchy - pt_30886: Complex Karyotype with a primitive leukemia cell hierarchy + extensive erythroid envolvement (acute erythroleukemia)
Each patient was downsampled to 2500 cells to decrease runtime for this example.
query
An object of class Seurat
36601 features across 7500 samples within 1 assay
Active assay: RNA (36601 features, 0 variable features)
Provide raw counts, metadata, and donor key. This should take <1 min Calculate mapping error and perform QC to remove low quality cells with high mapping error
# batch variable to correct in the query data, set as NULL if no batches in query
batchvar <- 'Patient'
# Map query dataset using Symphony (Kang et al 2021)
query <- map_Query(
exp_query = query@assays$RNA@counts,
metadata_query = query@meta.data,
ref_obj = ref,
vars = batchvar
)
Normalizing
Scaling and synchronizing query gene expression
Found 2360 reference variable genes in query dataset
Project query cells using reference gene loadings
Clustering query cells to reference centroids
Correcting query batch effects
UMAP
All done!
Warning: Invalid name supplied, making object name syntactically valid. New object name is Seurat..ProjectDim.RNA.harmony; see ?make.names for more details on syntax validity
In leukemia samples, the distribution of mapping error scores can vary broadly from sample to sample. In this context, we will want to threshold outliers with high mapping error on a per-sample basis. Typically, a threshold of 2, 2.5, or 3 MADs works well.
In some cases where sequencing depth is very low (e.g. older datasets from first-generation scRNA-seq protocols), a more stringent threshold of even 1.5 may be warranted to eliminate cells with low mapping quality
# Run QC based on mapping error score, flag cells with mapping error >= 2.5 MADs above median
query <- query %>% calculate_MappingError(., reference = ref, MAD_threshold = 2.5,
threshold_by_donor = TRUE, donor_key = batchvar) # threshold mapping error on a per-sample basis.
# Plot distribution by patient to ensure you are catching the tail
query@meta.data %>%
ggplot(aes(x = mapping_error_score, fill = mapping_error_QC)) +
geom_histogram(bins = 200) + facet_wrap(.~get(batchvar))
# Get QC Plots
QC_plots <- plot_MappingErrorQC(query)
# Plot together - If this is too crowded, can also just call "QC_plots" aloneto display one by one
patchwork::wrap_plots(QC_plots, ncol = 4, widths = c(0.8, 0.3, 0.8, 0.3))
This important step identifies a subset of cells with high mapping error from the query dataset that are either: 1) not present within the reference, or 2) have poor QC metrics (low RNA counts and low transcriptional diversity)
Sometimes, low quality cells may erroneously map to the orthochromatic erythroblast region as this cell type has very low transcriptional diversity. These low quality query cells do not have hemoglobin expression and are in fact mis-mapped; they will be flagged by the QC filter and excluded from cell type assignments.
Please adjust the MAD.threshold (typically between 1 and 3) based on the distribution of your dataset to identify the outliers with low quality and high mapping error scores. This will improve your classifications and any downstream composition analysis
# # Optional step - remove outliers with high mapping error
# query <- subset(query, mapping_error_QC == 'Pass')
Optionally, outlier cells with high mapping error can also be removed at this stage. For ease of integrating these mapped annotations with the rest of your analysis, we also choose to skip this step. If so, Final CellType and Pseudotime predictions will be assigned as NA for cells failing the mapping error QC threshold.
We will next use a KNN classifier to assign cell identity based on the 30 K-Nearest Neighbours from the reference map. This label transfer step will take longer, potentially around 10 minutes for ~10,000 cells
# Predict Hematopoietic Cell Types by KNN classification
query <- predict_CellTypes(
query_obj = query,
ref_obj = ref,
initial_label = 'initial_CellType', # celltype assignments before mapping QC
final_label = 'predicted_CellType' # celltype assignments with map QC failing cells assigned as NA
)
DimPlot(subset(query, mapping_error_QC == 'Pass'), reduction = 'umap', group.by = c('predicted_CellType'), raster=FALSE, label=TRUE, label.size = 4)
We can also annotate each query cell based on their position along hematopoietic pseudotime. Query cells will be assigned a pseudotime score based on the 30 K-Nearest Neighbours from the reference map. Since our Pseudotime KNN assignments are performed in UMAP space (more accurate than KNN on harmony components), this step is very fast (< 10s)
# Visualize Hematopoietic Pseudotime in query data
FeaturePlot(subset(query, mapping_error_QC == 'Pass'), features = c('predicted_Pseudotime'), split.by = 'Patient') &
scale_color_gradientn(colors = rev(brewer.pal(11, 'RdBu')))
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Now let’s visualize the density distribution of query cells across the hematopoietic hierarchy
We can also set Hierarchy_only = TRUE to remove T/NK/Plasma/Stromal cells and focus solely on the hematopoietic hierarchy.
Here, to study the abundance of each cell type within each donor, I focus on cells that were classified with a KNN prob > 0.5 (that is, >50% of nearest neighbours from the reference map agree on the assigned cell type).
We can present this as a long table
query_composition <- get_Composition(
query_obj = query,
donor_key = 'Patient',
celltype_label = 'predicted_CellType',
mapQC_col = 'mapping_error_QC',
knn_prob_cutoff = 0.5,
return_type = 'long')
query_composition
Or as a wide table with counts of # of cells
query_composition <- get_Composition(
query_obj = query,
donor_key = 'Patient',
celltype_label = 'predicted_CellType',
mapQC_col = 'mapping_error_QC',
knn_prob_cutoff = 0.5,
return_type = 'count')
query_composition
Or as a wide table with proportion of each cell type within each donor
query_composition <- get_Composition(
query_obj = query,
donor_key = 'Patient',
celltype_label = 'predicted_CellType',
mapQC_col = 'mapping_error_QC',
knn_prob_cutoff = 0.5,
return_type = 'proportion')
query_composition
query_composition %>% select(Patient, colnames(query_composition)[-1][colSums(query_composition[-1]) > 0.02])
# Simple heatmap to visualize composition of projected samples
p <- query_composition %>%
# show celltypes present in >1% of total cells
select(Patient, colnames(query_composition)[-1][colSums(query_composition[-1]) > 0.01]) %>%
# convert to matrix and display heatmap
column_to_rownames('Patient') %>% data.matrix() %>% ComplexHeatmap::Heatmap()
p
Here we will load genesets derived from functional studies of LSC+ and LSC- fractions in diverse AML patients (Ng Nature 2017), and LSC+ vs LSC- fractions in patients with MLL translocations (Somervaille Cell Stem Cell 2009). The former tends to correspond to more primitive LSCs while the latter corresponds to a distinct signature of mature promonocytic LSCs.
Further, we will load additional genesets derived from marker genes of leukemia cell populations defined in van Galen et al Cell 2019 and Zeng et al Nature Medicine 2022. Notably, the LSPC-Quiescent population is highly associated with functional LSCs and relapse by deconvolution (Zeng 2022).
curl::curl_download('https://raw.githubusercontent.com/andygxzeng/BoneMarrowMap/main/inst/tutorial/AML_CellType_Genesets.gmt',
destfile = paste0(projection_path, 'AML_CellType_Genesets.gmt'))
AMLgenesets <- load_Genesets_gmt('AML_CellType_Genesets.gmt')
Read 11 items
AMLgenesets %>% summary()
Length Class Mode
LSC104_Ng2016_UP 47 -none- character
LSC104_Ng2016_DOWN 56 -none- character
MLL_LSC_Somervaille2009_UP 176 -none- character
MLL_LSC_Somervaille2009_DOWN 352 -none- character
LSPC_Quiescent 61 -none- character
LSPC_Primed_Top100 100 -none- character
LSPC_Cycle_Top100 100 -none- character
GMP_like_Top100 100 -none- character
ProMono_like_Top100 100 -none- character
Mono_like_Top100 100 -none- character
cDC_like_Top100 100 -none- character
query@meta.data <- query@meta.data %>% select(-contains('AUC'))
# score genesets by AUCell and add to metadata
# typically I split the data into batches of 5k-10k cells to conserve memory
query <- score_Genesets_AUCell(query, genesets = AMLgenesets, nbatches = 2, ncores = 8, output = 'metadata')
[1] "Running AUCell scoring"
[1] "batch 1"
Warning: sparse->dense coercion: allocating vector of size 1.0 GiBQuantiles for the number of genes detected by cell:
(Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
min 1% 5% 10% 50% 100%
527.00 633.00 775.45 939.90 1926.50 5535.00
Using 8 cores.
Using 8 cores with doMC.
Genes in the gene sets NOT available in the dataset:
LSC104_Ng2016_UP: 1 (2% of 47)
LSC104_Ng2016_DOWN: 6 (11% of 56)
MLL_LSC_Somervaille2009_UP: 9 (6% of 160)
MLL_LSC_Somervaille2009_DOWN: 6 (2% of 285)
LSPC_Quiescent: 6 (10% of 61)
LSPC_Primed_Top100: 9 (9% of 100)
LSPC_Cycle_Top100: 2 (2% of 100)
GMP_like_Top100: 3 (3% of 100)
cDC_like_Top100: 7 (7% of 100)
Garbage collection 73 = 37+7+29 (level 2) ...
503.4 Mbytes of cons cells used (64%)
5832.2 Mbytes of vectors used (31%)
[1] "batch 2"
Warning: sparse->dense coercion: allocating vector of size 1.0 GiBQuantiles for the number of genes detected by cell:
(Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
min 1% 5% 10% 50% 100%
507.00 581.47 713.45 807.00 1241.50 7032.00
Using 8 cores.
Using 8 cores with doMC.
Genes in the gene sets NOT available in the dataset:
LSC104_Ng2016_UP: 1 (2% of 47)
LSC104_Ng2016_DOWN: 6 (11% of 56)
MLL_LSC_Somervaille2009_UP: 9 (6% of 160)
MLL_LSC_Somervaille2009_DOWN: 6 (2% of 285)
LSPC_Quiescent: 6 (10% of 61)
LSPC_Primed_Top100: 9 (9% of 100)
LSPC_Cycle_Top100: 2 (2% of 100)
GMP_like_Top100: 3 (3% of 100)
cDC_like_Top100: 7 (7% of 100)
Garbage collection 75 = 38+7+30 (level 2) ...
503.4 Mbytes of cons cells used (64%)
5832.5 Mbytes of vectors used (31%)
[1] "Finished Scoring"
query@meta.data
Using genes associated with functional LSC+ engraftment capacity (LSC104_Ng2016_UP) together with the LSPC-Quiescent signature help us to identify candidate LSCs within the data. It is no surprise that these are the most primitive cells earliest in Pseudotime. Note that this approach does not work for the MLL-AF9 samples wherein all leukemic cells are mature myeloid; in this context the MLL_LSC_Somervaille_2009_UP signature may help us identify functional “LSCs” capable of engraftment within the mature myeloid populations (typically late GMPs or Early ProMonocytes).
FeaturePlot(subset(query, mapping_error_QC == 'Pass'), features = c('LSPC_Quiescent_AUC', 'LSC104_Ng2016_UP_AUC'),
split.by = 'Patient', max.cutoff = 'q99', min.cutoff = 'q5') &
scale_color_gradientn(colors = rev(brewer.pal(11, 'RdBu')))
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
This will save a csv file with the mapped annotations for each cell (mapping error scores, umap coordinates, predicted Cell Type, and predicted Pseudotime). We also have the option to save AUCell scores, provided that they are in the metadata.
# Save CellType Annotations and Projected UMAP coordinates
save_ProjectionResults(
query_obj = query,
celltype_label = 'predicted_CellType',
celltype_KNNprob_label = 'predicted_CellType_prob',
pseudotime_label = 'predicted_Pseudotime',
save_AUCell_scores = TRUE,
file_name = 'querydata_projected_labeled.csv')
For downstream analysis, you can use the projected celltype labels to help annotate any leukemia cell clusters generated through unsupervised dimensionality reduction and clustering from individual patients. Sometimes, unsupervised analysis will yield clusters corresponding to different developmental states and other times it may yield clusters corresponding to distinct subclones within the patient. Along with tools like inferCNV for patients with known cytogenetic abnormalities, this can be integrated to visualize cellular hierarchies at the level of individual subclones.
Additional information from pseudotime projection and scoring of LSC-specific signatures can help identify candidate LSCs within the data.
Finally, for cohorts with many patients composition analysis can be performed to understand how leukemia cell hierarchies vary with patient characteristics and therapy response / relapse. I hope this tutorial was useful and feel free to comment with any questions you may have around projection of leukemia cells or downstream analysis.